Chapter 3 - Linear Regression
This chapter is about linear regression, a very simple approach for supervised learning. Linear regression is a useful and widely used statistical learning method.
We use linear regression for predicting a _____________ response.
Simple Linear Regression
A very straightforward approach for predicting a _____________ response \(Y\) on the basis of a _____________.
It assumes that there is approximately a linear relationship between \(X\) and \(Y\) . Mathematically, we can write this linear relationship as
We will sometimes describe this relationship by saying that we are regressing \(Y\) on \(X\) (or \(Y\) onto \(X\)).
Example 1:
Recall the Advertising data from Chapter 2. In this data, sales (in thousands of units) for a particular product and advertising budgets (in thousands of dollars) for TV, radio, and newspaper media are recorded.
AdvertisingData <- read.csv("https://raw.githubusercontent.com/nguyen-toan/ISLR/master/dataset/Advertising.csv", header = TRUE, sep = ",")
AdvertisingData <- AdvertisingData[,-1]
head(AdvertisingData)## TV Radio Newspaper Sales
## 1 230.1 37.8 69.2 22.1
## 2 44.5 39.3 45.1 10.4
## 3 17.2 45.9 69.3 9.3
## 4 151.5 41.3 58.5 18.5
## 5 180.8 10.8 58.4 12.9
## 6 8.7 48.9 75.0 7.2
Let \(X\) represents TV advertising and \(Y\) represent sales. Write the model for regressing sales onto TV:
____ and _____ are two unknown constants that represent the ____ and ____ terms in the linear model.
Together, ____ and ____ are known as the model ____ or ____.
Once we have used our training data to produce estimates (how?) ____ and ____ for the model coefficients, we can predict future sales on the basis of a particular value of TV advertising by computing
where ____indicates a prediction of \(Y\) on the basis of \(X = x\). Here we use a hat symbol, to denote the estimated value for an unknown parameter or coefficient, or to denote the predicted value of the response.
Estimating the Coefficients
Note: These scatter plots here are NOT the actual ones from Advertising data.
What is the line of best fit? _____________________
- \(y =\)
- \(\hat{y} =\)
- \(y - \hat{y} =\)
We get the LS line (estimates for the model coefficients) by minimizing the sum of squares of residuals (RSS)
\[RSS = \] Note: How do we minimize RSS? Derivatives are used to minimize RSS (outside the scope of our class).
After minimizing RSS we get our estimates ____ and ____ for the model coefficients:
The standard errors associated with \(\hat{\beta_0}\) and \(\hat{\beta_1}\)
In general, \(\sigma^2\) is not known, but can be estimated from the data. The estimate of \(\sigma\) is known as the __________, and is given by the formula
Example 2:
- From Advertising data it is found that:
- The average cost for TV advertising is 147.0425
- The standard deviation of cost for TV advertising is 85.8542363
- The average sales is 14.0225
- The standard deviation of sales 5.2174566
- Correlation coefficient between the cost for TV advertising and sales is 0.7822244
Find the LS estimates for model coefficients (\(\hat{\beta_0}\), \(\hat{\beta_1}\)) when regressing sales onto TV.
- Use the
lm()function to find the LS estimates for model coefficients (\(\hat{\beta_0}\), \(\hat{\beta_1}\)) when regressing sales onto TV. Compare your answers with a)
- State your final simple linear regression model
Interpreating regression coefficients
\(\hat{\beta_1}:\) The average increase/decrease in \(Y\) for every one unit increase in \(X\).
\(\hat{\beta_0}:\) The average value of \(Y\) when the value of \(X = 0\). In most cases, we will find no meaning in \(\hat{\beta_0}\).
Example 3: In the advertising data, the sales are recorded in thousands of units and the advertising costs are recorded in thousands of dollars. Interpret the model coefficient estimates you obtained in Example 2.
Confidence intervals for model coefficients \(\beta_1\) and \(\beta_0\)
A \(100(1-\alpha)\%\) (example: 95%) confidence interval is defined as a range of values such that with \(100(1-\alpha)\%\) (example: 95%) probability, the range will contain the true unknown value of the parameter.
For linear regression, the 95% confidence interval for \(\beta_1\) takes the form
That is, there is approximately a 95% chance that the interval
will contain the true value of β1.
Here are some common values for \(\beta_1\).
| \(100(1-α)\%\) | \(90\%\) | \(95\%\) | \(99\%\) |
|---|---|---|---|
| \(z^∗\) | 1.645 | 1.96 | 2.576 |
Similarly, a confidence interval for \(\beta_0\) approximately takes the form
Example 4:
Table below provides details of the least squares model for the regression of number of units sold on TV advertising budget for the Advertising data.
| Coefficient | Std. error | t-statistic | p-value | |
|---|---|---|---|---|
Intercept |
7.0325 | 0.4578 | 15.36 | < 0.0001 |
TV |
0.0475 | 0.0027 | 17.67 | < 0.0001 |
- Find a 90% confidence interval for \(\beta_0\)
- Find a 90% confidence interval for \(\beta_1\)
Example 5: Use R to find
Find a 90% confidence interval for \(\beta_0\)
Find a 90% confidence interval for \(\beta_1\)
when regressing Sales onto TV in Advertising data.
Hypothesis testing for \(\beta_0\) and \(\beta_1\)
The most common hypothesis test involves testing the null test hypothesis
\[H_0: \text{}\] \[H_a: \text{}\]
Mathematically, this corresponds to testing
\[H_0: \] \[H_a: \] (since if _______ then the model \(Y = \beta_0 + \beta_1X + \epsilon\) reduces to ________, and \(X\) is not associated with \(Y\).)
We usually use four steps to conduct a hypothesis test:
- State the null and alternative hypothesis:
- Calculate the test statistic: (get/ calculate this value from the \(R\) output)
This test statistic has a \(t\)-distribution with \(n−2\) degrees of freedom. The \(t\)-distribution has a bell shape and for values of \(n\) greater than approximately 30 it is quite similar to the normal distribution.
- Find the \(p\)-value:(get this value from the \(R\) output)
- Make the decision:
If the \(p\)-value < the given cutoff (\(\alpha\)) level, we reject \(H_0\). Then we say that: there is enough evidence to conclude that there is a linear relationship to exist between \(X\) and \(Y\).
If the \(p\)-value > the given cutoff (\(\alpha\)) level, we do not reject \(H_0\). Then we say that: there is not enough evidence to conclude that there is a linear relationship to exist between \(X\) and \(Y\).
Example 6:
Table below provides details of the least squares model for the regression of number of units sold on TV advertising budget for the Advertising data.
| Coefficient | Std. error | t-statistic | p-value | |
|---|---|---|---|---|
Intercept |
7.0325 | 0.4578 | 15.36 | < 0.0001 |
TV |
0.0475 | 0.0027 | 17.67 | < 0.0001 |
Perform a complete four step hypothesis test to check whether there is a linear relationship to exist between TV advertising budget and Sales.
- State the null and alternative hypothesis:
- Calculate the test statistic: (get/ calculate this value from the \(R\) output)
- Find the \(p\)-value:(get this value from the \(R\) output)
- Make the decision:
Example 7:
Use \(R\) to perform a complete four step hypothesis test to check whether there is a linear relationship to exist between TV advertising budget and Sales.
Assessing the Accuracy of the Model
The quality of a linear regression fit is typically assessed using two related quantities:
1. Residual Standard Error (RSE)
The RSE is considered a measure of the lack of fit of the model to the data.
$$RSE = $$
If the predictions obtained using the model are very close to the true outcome values—that is, if _________ then RSE will be _________, and we can conclude that the model fits the data very well.
On the other hand, if ___ is very far from ____ for one or more observations, then the RSE may be _________, indicating that the model doesn’t fit the data well.
Units of RSE is same as the unites of the \(y\) variable.
Example 8: In the case of the advertising data, find the RSE using R. Interpret this value.
Actual _____ in deviate from the true regression line by approximately _____ units, on average.
2. \(R^2\) Statistic
The \(R^2\) statistic provides an alternative measure of fit.
\[R^2 = \] where \(TSS = \sum(y_i − \bar{y})^2\) is the total sum of squares.
TSS measures the total variance in the response \(Y\), and can be thought of as the amount of variability inherent in the response before the regression is performed.
Note that: \(0 \leq R^2 \leq 1\)
How to interpret \(R^2\): \(R^2\) measures the proportion of variability in \(Y\) that can be explained using \(X\).
An \(R^2\) statistic that is close to 1 indicates that a large proportion of the variability in the response has been explained by the regression.
An \(R^2\) statistic near 0 indicates that the regression did not explain much of the variability in the response
In the the simple linear regression setting, \(R^2 = r^2\). Here \(r^2\) is the sample correlation.
Example 9: In the case of the advertising data, find the \(R^2\) using R. Interpret this value.
Using R, verify that \(R^2 = r^2\).
Multiple Linear Regression (MLR)
library(plotly)
p <- plot_ly(data = AdvertisingData, z = ~Sales, x = ~TV, y = ~Radio, opacity = 0.6, color = AdvertisingData$Sales) %>%
add_markers()
pNow we have \(p\) distinct predictors (not only one as in simple linear regression). Then the multiple linear regression model takes the form
$$$$
where \(X_j\) represents the \(j\)th predictor and \(\beta_j\) quantifies the association between that variable and the response.
- How to interpret the \(\beta_j\) values: We interpret \(\beta_j\) as the average effect on \(Y\) for a one unit increase in \(X_j\) , holding all other predictors fixed.
Example 10: In the case of the advertising data, write the MLR model using all the available predictors (TV, Radio, Newspaper) to predict the response Sales.
Estimating the Regression Coefficients
As was the case in the simple linear regression setting, the regression coefficients _________ in the MLR model are unknown, and must be estimated. Given estimates _________ we can make predictions using the formula
Example 11: In the case of the advertising data;
find the MLR model using all the available predictors to predict the response
Salesin R.Write the MLR equation
Interpret each coefficient
Correlation matrix
A correlation matrix is a table of correlation coefficients for a set of variables used to determine if a relationship exists between the variables. The coefficient indicates both the strength of the relationship as well as the direction (positive vs. negative correlations)
#method 1
library(corrplot)
AdvertisingData.cor= cor(AdvertisingData, method = c("spearman"))
corrplot(AdvertisingData.cor)#method 3
library("PerformanceAnalytics")
chart.Correlation(AdvertisingData, histogram=TRUE, pch=19)